Probabilistic topic decomposition of an eighteenth-century American newspaper

نویسندگان

  • David J. Newman
  • Sharon Block
چکیده

vector space model for text data (Salton & McGill, 1983). In this model, each document in a corpus is represented by a term-frequency vector whose elements are the number of occurrences of each word in the vocabulary. Collectively, the set of these term-frequency vectors forms the document– word matrix representation of the corpus. All the methods we consider have this document–word matrix representation as the starting point. The classic information retrieval method, tf-idf (term-frequency inverse-document-frequency), is used in many search engines today. Despite tf-idf’s popularity, it does not handle synonymy and polysemy. Deerwester, Dumais, Furnas, Landauer, and Harshman (1990) devised Latent Semantic Analysis (LSA) to address this deficiency. Their method for detecting relevant documents based on words in queries improved upon simple word matching. Their association of words with documents (what they called semantic structure) moves us closer to the notion of topics. For example, LSA allows one to compute whether two documents are topically similar, even if the two documents do not have any words in common. There has been a huge increase in the number of historical primary sources available online.1 Yet there has been little work done on processing, modeling, or analyzing these recently-available corpora. Previous studies of historic document collections were limited by the number of items a researcher could analyze in a reasonable amount of time. For instance, Clark and Wetherell (1989) analyzed the Pennsylvania Gazette by sampling less than 10% of the total number of articles in just a 33 year period. Other authors analyzed a single category of a newspaper’s content, such as

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Newspaper reporting and attitudes to crime and justice in late eighteenth and early nineteeth century London

As other sources of printed information about crime, such as the Ordinary’s Accounts of the lives of executed criminals, lost their audience in the final third of the eighteenth century, newspapers came increasingly to dominate printed discussions of crime. However, no substantial study of the overall nature of newspaper reporting on crime and criminal justice issues has yet been undertaken. By...

متن کامل

Luxury in the Eighteenth Century: Debates, Desires and Delectable Goods

Luxury in the Eighteenth Century is a welcome collection of essays on a very important topic. Since the 1982 appearance of the path-breaking The Birth of a Consumer Society, studies of consumption in eighteenth-century Western Europe have proliferated to confirm the thesis that the century experienced a dramatic surge in the production and consumption of goods.(1) This new, handsomely-edited vo...

متن کامل

Luxury in the Eighteenth Century: Debates, Desires and Delectable Goods

Luxury in the Eighteenth Century is a welcome collection of essays on a very important topic. Since the 1982 appearance of the path-breaking The Birth of a Consumer Society, studies of consumption in eighteenth-century Western Europe have proliferated to confirm the thesis that the century experienced a dramatic surge in the production and consumption of goods.(1) This new, handsomely-edited vo...

متن کامل

News from the Hesburgh Libraries of Notre Dame

T he recent acquisition of a major microfilm collection titled “Early English Newspapers” has added 1,412 newspapers and broadsides to the Hesburgh Libraries’ holdings. This extraordinary purchase was made possible by the discernment and generosity of a group of University benefactors known as “The President’s Circle.” Many faculty and students are unaware of the wealth of primary sources avail...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • JASIST

دوره 57  شماره 

صفحات  -

تاریخ انتشار 2006